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Abstract 

Bernard Lang defines parsing as the cal- 
culation of the intersection of a FSA (the 
input) and a CFG. Viewing the input for 
parsing as a FSA rather than as a string 
combines well with some approaches in 
speech understanding systems, in which 
parsing takes a word lattice as input 
(rather than a word string). Furthermore, 
certain techniques for robust parsing can 
be modelled as finite state transducers. 

In this paper we investigate how we can 
generalize this approach for unification 
grammars. In particular we will concen- 
tratc on how we might the calculation of 
the intersection of a FSA and a DCG . It ia 



shown that existing parsing algorithms can 
be easily extended for FSA inputs. How- 
ever, we also show that the termination 
properties change drastically: we show that 
it is undecidable whether the intersection 
of a FSA and a DCG is empty (even if the 
DCG is off-line parsable). 

Furthermore we discuss approaches to cope 
with the problem. 



1 Introduction 

In this paper we are concerned with the syntactic 
analysis phase of a natural language understanding 
system. Ordinarily, the input of such a system is a 
sequence of words. However, following Bernard Lang 
we argue that it might be fruitful to take the input 
more generally as a finite state automaton (FSA) 
to model cases in which we are uncertain about the 
actual input. Parsing uncertain input might be nec- 
essary in case of ill- formed textual input, or in case 
of speech input. 



For example, if a natural language understanding 
system is interfaced with a speech recognition com- 
ponent, chances are that this compenent is uncertain 
about the actual string of words that has been ut- 
tered, and thus produces a word lattice of the most 
promising hypotheses, rather than a single sequence 
of words. FSA of course generalizes such word lat- 
tices. 

As another example, certain techniques to deal 
with ill-formed input can be characterized as finite 
state transducers ( Lang, 1989| ); the composition of 
an input string with such a finite state transducer 
results in a FSA that can then be input for syntac- 
tic parsing. Such an approach allows for the treat- 
ment of missing, extraneous, interchanged or mis- 
used words (Tcitclbaum, 1973 ; Saito and Tomita 



1988 ; INederhof and Bcrtsch, 1994| ) 



Such techniques might be of use both in the case 
of written and spoken language input. In the latter 
case another possible application concerns the treat- 
ment of phenomena such as repairs (Carter, 1994). 

Note that we allow the input to be a full FSA 
(possibly including cycles, etc.) since some of the 
above-mentioned techniques indeed result in cycles. 
Whereas an ordinary word-graph always defines a 
finite language, a FSA of course can easily de- 
fine an infinite number of sentences. Cycles might 
emerge to treat unknown sequences of words, i.e. 
sentences with unknown parts of unknown lengths 



( |Lang, 1988| ). 

As suggested by an ACL reviewer, one could also 
try to model haplology phenomena (such as the 's in 
English sentences like 'The chef at Joe's hat', where 
'Joe's' is the name of a restaurant) using a finite 
state transducer. In a straightforward approach this 
would also lead to a finite-state automaton with 
cycles. 



It can be shown that the computation of the inter- 
section of a FSA and a CFG requires only a minimal 



generalization of existing parsing algorithms. We 
simply replace the usual string positions with the 
names of the states in the FSA. It is also straight- 
forward to show that the complexity of this process 
is cubic in the number of states of the FSA (in the 
case of ordinary parsing the number of states equals 
n+1) (Lang, 1974; Billot and Lang, 1989) ) (assuming 
the right-hand-sides of grammar rules have at most 
two categories). 

In this paper we investigate whether the same 
techniques can be applied in case the grammar is a 
constraint-based grammar rather than a CFG. For 
specificity we will take the gra mmar to be a Def- 
inite Clause Grammar (DCG) (Pereira and War- 



ren, 1980). A DCG is a simple example of a fam- 



ily of constraint-based grammar formalisms that 
are widely used in natural language analysis (and 
generation). The main findings of this paper can 
be extended to other members of that family of 
constraint-based grammar formalisms. 

2 The intersection of a CFG and a 
FSA 

The calculation of the intersection of a CFG and 



a FSA is very simple ( Bar-Hillel et al., 1961 ). 
The (context-free) grammar defining this intersec- 
tion is simply constructed by keeping track of the 
state names in the non-terminal category symbols. 
For each rule Xq —> X\ . . . X n there are rules 
(X qoq) — * (Xiq a qi){X 2 qiq2) ■ ■ ■ (X n q n -iq), for all 
qo . . . q n . Furthermore for each transition 8(qi, a) = 
qk we have a rule (crqiqk) ~ * ° '■ Thus the intersection 
of a FSA and a CFG is a CFG that exactly derives 
all parse-trees. Such a grammar might be called the 
parse-forest grammar. 

Although this construction shows that the inter- 
section of a FSA and a CFG is itself a CFG, it is 
not of practical interest. The reason is that this 
construction typically yields an enormous amount 
of rules that are 'useless'. In fact the (possibly enor- 
mously large) parse forest grammar might define 
an empty language (if the intersection was empty). 
Luckily 'ordinary' recognizers/parsers for CFG can 
be easily generalized to construct this intersection 
yielding (in typical cases) a much smaller grammar. 
Checking whether the intersection is empty or not is 
then usually very simple as well: only in the latter 
case will the parser terminate succesfully. 

To illustrate how a parser can be generalized to 
accept a FSA as input we present a simple top-down 
parser. 

A context-free grammar is represented as a 



definite-clause specification as follows. We do not 
wish to define the sets of terminal and non-terminal 
symbols explicitly, these can be understood from the 
rules that are defined using the relation rule/2, and 
where symbols of the rhs are prefixed with '-' in 
the case of terminals and '+' in the case of non- 
terminals. The relation top/1 defines the start sym- 
bol. The language V — a n b n is defined as: 

top(s) . 

rule (s , [-a,+s ,-b] ) . rule(s,[]). 

In order to illustrate how ordinary parsers can be 
used to compute the intersection of a FSA and a 
CFG consider first the definite-clause specification 
of a top-down parser. This parser runs in polyno- 
mial time if implemented using Earley deduction or 
XOLDT resolution ( Warren, 1992| ). It is assumed 
that the input string is represented by the trans/3 
predicate. 

parse (P0,P) 
top (Cat) . 



parse (+Cat,P0,P) . 



parse (-Cat ,P0,P) :- 
trans (PO.Cat.P) , 

side_effect(p(Cat,PO,P) — > Cat) . 
parse (+Cat,P0,P) :- 
rule(Cat,Ds) , 
parse_ds(Ds,PO,P,His) , 
side_effect(p(Cat,PO,P) — > His) . 

parse_ds( [] ,P,P, [] ) . 

parse_ds ([HIT] ,P0,P, [p(H,P0,Pl) I His]) :- 
parse(H,PO,Pl) , 
parse_ds(T,Pl,P,His) . 

The predicate side_effect is used to construct 
the parse forest grammar. The predicate always suc- 
ceeds, and as a side-effect asserts that its argument 
is a rule of the parse forest grammar. For the sen- 
tence 'a a b b' we obtain the parse forest grammar: 

p(s,2,2) — > [] . 
p(s,l,3) — > 

[p(-a,1.2),p(+s ) 2,2),p(-b,2,3)] . 
p(s,0,4) — > 

[p(-a,0,l),p(+s,l,3),p(-b,3,4)] . 
p(a,l,2) — > a. 
p(a,0,l) — > a. 
p(b,2,3) — > b. 
p(b,3,4) — > b. 

The reader easily verifies that indeed this grammar 
generates (a isomorphism of) the single parse tree 
of this example, assuming of course that the start 
symbol for this parse- forest grammar is p(s,0,4). 
In the parse- forest grammar, complex symbols are 
non-terminals, atomic symbols are terminals. 

Next consider the definite clause specification 
of a FSA. We define the transition relation using 
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Figure 1: A parse-tree extracted from the parse forest grammar 



the relation trans/3. For start states, the relation 
start/ f should hold, and for final states the relation 
final/1 should hold. Thus the following FSA, 
defining the regular language L = (aa)*b + (i.e. an 
even number of a's followed by at least one b) is 
given as: 




start(qO). f inal(q2) . 



trans(qO,a,ql) . 
trans(q0,b,q2) . 



trans (ql , a,qO) . 
trans (q2,b,q2) . 



Interestingly, nothing needs to be changed to use 
the same parser for the computation of the intersec- 
tion of a FSA and a CFG. If our input 'sentence' now 
is the definition of trans/3 as given above, we ob- 
tain the following parse forest grammar (where the 
start symbol is p(s,q0,q2)): 

P (s,qO,qO) — > [] . 
p(s,ql,ql) — > [] . 
p(s,ql,q2) — > 

[p(-a,ql,qO) .pC+s.qO.qO) ,p(-b,qO ,q2)] . 
p(s,q0,q2) — > 

[p(-a,qO,ql) ,p(+s,ql,q2) ,p(-b,q2,q2)] . 
p(s,ql,q2) — > 

[p(-a,ql,qO) ,p(+s,q0,q2) ,p(-b,q2,q2)] . 
p(a,qO,ql) — > a. 
p(a,ql,qO) — > a. 
p(b,q0,q2) — > b. 
p(b,q2,q2) — > b. 

Thus, even though we now use the same parser for 
an infinite set of input sentences (represented by 
the FSA) the parser still is able to come up with 



a parse forest grammar. A possible derivation for 
this grammar constructs the following (abbreviated) 
parse tree in figure [l]. Note that the construction of 
Bar Hillel would have yielded a grammar with 88 
rules. 

3 The intersection of a DCG and a 
FSA 

In this section we want to generalize the ideas de- 
scribed above for CFG to DCG. 

First note that the problem of calculating the in- 
tersection of a DCG and a FSA can be solved triv- 



ially by a generalization of the construction by (Bar 



Hillel et al., 1961). However, if we use that method 



we will end up (typically) with an enormously large 
forest grammar that is not even guaranteed to con- 
tain solutions []. Therefore, we are interested in 
methods that only generate a small subset of this; 
e.g. if the intersection is empty we want an empty 
parse-forest grammar. 

The straightforward approach is to generalize ex- 
isting recognition algorithms. The same techniques 
that are used for calculating the intersection of a 
FSA and a CFG can be applied in the case of DCGs. 
In order to compute the intersection of a DCG and 
a FSA we assume that FSA are represented as be- 
fore. DCGs are represented using the same nota- 
tion we used for context-free grammars, but now of 
course the category symbols can be first-order terms 



1 In fact, the standard compilation of DCG into Pro- 
log clauses does something similar using variables in- 
stead of actual state names. This also illustrates that 
this method is not very useful yet; all the work has still 
to be done. 



of arbitrary complexity (note that without loss of 
generality we don't take into account DCGs having 
external actions defined in curly braces). 

But if we use existing techniques for parsing 
DCGs, then we are also confronted with an undecid- 
ability problem: the recognition problem for DCGs 
is undecidable ( Pereira and Warren, 1983 ). A for- 
tiori the problem of deciding whether the intersec- 
tion of a FSA and a DCG is empty or not is unde- 
cidable. 

This undccidability result is usually circumvented 
by considering subsets of DCGs which can be rec- 
ognized effectively. For example, we can restrict the 
attention to DCGs of which the context-free skeleton 
does not contain cycles. Recognition for such 'off- 



that 
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,Wi 



line parsable' grammars is dccidablc (Pcrcira and 



Most existing constraint-based parsing algorithms 
will terminate for grammars that exhibit the prop- 
erty that for each string there is only a finite number 
of possible derivations. Note that off-line parsability 
is one possible way of ensuring that this is the case. 

This observation is not very helpful in establish- 
ing insights concerning interesting subclasses of 
DCGs for which termination can be guaranteed (in 
the case of FSA input). The reason is that there 
are now two sources of recursion: in the DCG and 
in the FSA (cycles). As we saw earlier: even for 
CFG it holds that there can be an infinite number 
of analyses for a given FSA (but in the CFG this of 
course does not imply undecidability) . 



3.1 Intersection of FSA and off-line 
parsable DCG is undecidable 

I now show that the question whether the intersec- 
tion of a FSA and an off-line parsable DCG is empty 
is undecidable. A yes-no problem is undecidable (cf. 
( |Hopcroft and Ullman, 1979] pp.178-179)) if there is 
no algorithm that takes as its input an instance of 
the problem and determines whether the answer to 
that instance is 'yes' or 'no'. An instance of a prob- 
lem consists of a particular choice of the parameters 
of that problem. 

I use Post's Correspondence Problem (PCP) as a 
well-known undecidable problem. I show that if the 
above mentioned intersection problem were decid- 
able, then we could solve the PCP too. The follow- 
ing definition and example of a PCP are taken from 
( Hopcroft and Ullman, 1979) ) [chapter 8.5]. 

An instance of PCP consists of two lists, A = 
vi . . . Vk and B = w± . . . Wk of strings over some al- 
phabet S. This instance has a solution if there is 
any sequence of integers i\ . . . i m , with m > 1, such 



The sequence i\,...,i m is a solution to this in- 
stance of PCP. As an example, assume that £ = 
{0,1}. Furthermore, let A = (1,10111,10) and 
B = (111,10,0). A solution to this instance of 
PCP is the sequence 2,1,1,3 (obtaining the sequence 
101111110). For an illustration, cf. figure |. 

Clearly there are PCP's that do not have a solu- 
tion. Assume again that £ = {0,1}. Furthermore 
let A = (1) and B = (0). Clearly this PCP does 
not have a solution. In general, however, the prob- 
lem whether some PCP has a solutio n or not is not 
decidable. Th is result is proved by ( Hopcroft and 



Ullman, 1979| ) by showing that the halting problem 



for Turing Machines can be encoded as an instance 
of Post's Correspondence Problem. 

First I give a simple algorithm to encode any in- 
stance of a PCP as a pair, consisting of a FSA and 
an off-line parsable DCG, in such a way that the 
question whether there is a solution to this PCP is 
equivalent to the question whether the intersection 
of this FSA and DCG is empty. 

Encoding of PCP. 

1. For each 1 < i < k (k the length of lists A 
and B) define a DCG rule (the i — th member 
of A is ax . . . a rn , and the i— th member of B is 
h...b n y. r([a 1 ...a m \A],A,[b 1 ...b n \B],B) -> 
[x]. 

2. Furthermore, there is a rule r(Ao, A, Bq, B) — > 
r{Ao, A u B , Bt), r{Ax,A, B x , B). 

3. Furthermore, there is a rule s — > r(X, [ ], X, [ ]). 
Also, s is the start category of the DCG. 

4. Finally, the FSA consists of a single state q 
which is both the start state and the final state, 
and a single transition S(q, x) = q. This FSA 
generates x*. 

Observe that the DCG is off-line parsable. 

The underlying idea of the algorithm is really very 
simple. For each pair of strings from the lists A 
and B there will be one lexical entry (deriving the 
terminal x) where these strings are represented by 
a difference-list encoding. Furthermore there is a 
general combination rule that simply concatenates 
A-strings and concatenates B-strings. Finally the 
rule for s states that in order to construct a succesful 
top category the A and B lists must match. 

The resulting DCG, FSA pair for the example 
PCP is given in figure |5| 






Figure 2: Instance of a PCP problem. 





= 101111110 



= 101111110 



Figure 3: Illustration of a solution for the PCP problem of figure 



y.'/jmatch 

not 



trans (qO ,x,qO) . start(qO). final(qO). '/, FSA 

top(s) . 7o start symbol DCG 

rule(s, [-r(X, [] ,X, [])]) . '/. require A's and B's 

rule (r (AO, A, BO ,B) , [-r(A0,Al,B0,Bl) , '/, combine two sequences 

-r(Al,A,Bl,B)]) . % blocks 

rule(r([l|A] , A, [1 , 1 , 1 |B] ,B) , [+x] ) . */. block Al/Bl 

rule(r([l, 0,1,1,1|A] ,A, [1,0|B] , B) , [+x] ) . */. block A2/B2 

rule (r( [1,0 I A] , A,[0|B], B) , [+x] ) . % block A3/B3 



Figure 4: The encoding for the PCP problem of figure ||. 



Proposition The question whether the intersec- 
tion of a FSA and an off-line parsablc DCG is empty 
is undecidable. 

Proof. Suppose the problem was decidable. In 
that case there would exist an algorithm for solv- 
ing the problem. This algorithm could then be used 
to solve the PCP, because a PCP ir has a solution if 
and only if its encoding given above as a FSA and an 
off-line parsable DCG is not empty. The PCP prob- 
lem however is known to be undecidable. Hence the 
intersection question is undecidable too. 

3.2 What to do? 

The following approaches towards the undecidability 
problem can be taken: 

• limit the power of the FSA 

• limit the power of the DCG 

• compromise completeness 

• compromise soundness 

These approaches are discussed now in turn. 

Limit the FSA Rather than assuming the input 
for parsing is a FSA in its full generality we might 
assume that the input is an ordinary word graph (a 
FSA without cycles). 

Thus the techniques for robust processing that 
give rise to such cycles cannot be used. One example 
is the processing of an unknown sequence of words, 
e.g. in case there is noise in the input and it is not 
clear how many words have been uttered during 
this noise. It is not clear to me right now what we 
loose (in practical terms) if we give up such cycles. 

Note that it is easy to verify that the question 
whether the intersection of a word-graph and an off- 
line parsable DCG is empty or not is decidable since 
it reduces to checking whether the DCG derives one 
of a finite number of strings. 

Limit the DCG Another approach is to limit the 
size of the categories that are being employed. This 
is the GPSG and F-TAG approach. In that case we 
are not longer dealing with DCGs but rather with 
CFGs (which have been shown to be insufficient in 
general for the description of natural languages). 

Compromise completeness Completeness in 
this context means: the parse forest grammar con- 
tains all possible parses. It is possible to compromise 
here, in such a way that the parser is guaranteed to 
terminate, but sometimes misses a few parse-trees. 



For example, if we assume that each edge in the 
FSA is associated with a probability it is possible to 
define a threshold such that each partial result that 
is derived has a probability higher than the thres- 
hold. Thus, it is still possible to have cycles in the 
FSA, but anytime the cycle is 'used' the probability 
decreases and if too many cycles are encountered the 
threshold will cut off that derivation. 

Of course this implies that sometimes the intersec- 
tion is considered empty by this procedure whereas 
in fact the intersection is not. For any threshold it 
is the case that the intersection problem of off-line 
parsable DCGs and FSA is decidable. 

Compromise soundness Soundness in this con- 
text should be understood as the property that all 
parse trees in the parse forest grammar are valid 
parse trees. A possible way to ensure termination is 
to remove all constraints from the DCG and parse 
according to this context-free skeleton. The result- 
ing parse-forest grammar will be too general most of 
the times. 

A practical variation can be conceived as fol- 
lows. From the DCG we take its context-free skele- 
ton. This skeleton is obtained by removing the con- 
straints from each of the grammar rules. Then we 
compute the intersection of the skeleton with the in- 
put FSA. This results in a parse forest grammar. 
Finally, we add the corresponding constraints from 
the DCG to the grammar rules of the parse forest 
grammar. 

This has the advantage that the result is still 
sound and complete, although the size of the parse 
forest grammar is not optimal (as a consequence it is 
not guaranteed that the parse forest grammar con- 
tains a parse tree). Of course it is possible to exper- 
iment with different ways of taking the context-free 
skeleton (including as much information as possible 
/ useful). 
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